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Emergent behaviors are in the focus of recent research interest. It is then of considerable im- 
portance to investigate what optimizations suit the learning and prediction of chaotic systems, the 
putative candidates for emergence. We have compared L\ and Li regularizations on predicting 
chaotic time series using linear recurrent neural networks. The internal representation and the 
weights of the networks were optimized in a unifying framework. Computational tests on different 
problems indicate considerable advantages for the L\ regularization: It had considerably better 
learning time and better interpolating capabilities. We shall argue that optimization viewed as 
£SJ ' a maximum likelihood estimation justifies our results, because L\ regularization fits heavy-tailed 

distributions - an apparently general feature of emergent systems - better. 
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I. INTRODUCTION 



o 

r/2 . We are interested in learning, representing, and predicting emergent phenomena. The general view is that emergent 
dynamics brought about by nature (see, e.g., 0, Q and references therein) and also, in some man made constructs 
(see, e.g., @,0TliIli3 and references therein) show chaotic behavior and exhibit heavy-tailed distributions. In turn, 
we are interested in learning, representing, and predicting chaotic processes. Such processes could be of considerable 
importance in neurobiology. According to the common view, the brain may apply chaotic systems to represent and 
to control nonlinear dynamics @, ■ Our study concerns the identification of chaotic dynamics using recurrent linear 
neural networks, a particular family of artificial neural networks (ANN). 

Considering ANNs, sparsified representations have been of considerable research interest and have shown reasonable 
success in representing natural scenes 

H011[HEHE1H[H- The underlying concept is to choose a represen- 
, tation, which can derive (generate, reconstruct) the input by using the least number of components, that is, which 
compresses information efficiently. From the theoretical point of view, this assumption may be seen as a variant of 
*"^5 ! Occam's razor principle (see, e.g., |17| and references therein), provided that each component has identical a priori 
O ■ probabilities. For a recent review on the neurobiological relevance of sparse representations, see . 

Numerical studies indicate that creation and pruning of connectivity matrices of ANNs can be advantageous: it may 
decrease learning time and may increase generalization capabilities (see, e.g., 0, H3, 0] ) • Encouraging experimental 
studies of joined representational and weight sparsifications have also been undertaken in generative networks |'2'2| . 
For a thorough review on weight pruning methods, see |23| and references therein. Weight sparsification is connected 
to structural risk minimization |2 1| , a method that provides a measure for generalization versus overfitting. Structural 
risk minimization is also related to certain regularization methods (see, e.g., [l7t l24l l2{| and references therein). 

We intend to demonstrate the advantages of sparsification. For the sake of simplicity we shall deal with linear 
systems. This simplifications allowed us to make a joined framework, which can treat sparse neural activities and 
sparse neural connectivity sets on equal footing. The framework was constructed to compare the performances of L\ 
and L2 norms. The chosen 'battlefield' is the identification of chaotic time series. This highly non-linear problem 
should be of real challenge to the linear schemes. In turn, we also ask, how good identification can be achieved using 
linear approaches. 

The paper is organized as follows: Basic concepts and the recurrent neuronal network to be studied are introduced 
in Section [H] The mathematical formalism and the approximations are described in Section ITTT1 Section llVl reviews 
some numerical experiments. Connections to information theory is discussed in Section Conclusions are drawn in 
the last section. 
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II. PRELIMINARIES 
A. Network architecture 

Neural architectures with reconstruction abilities shall be considered. In particular, our formalism concerns Elman 
type recurrent networks, which belong to the family of recurrent neural networks (RNNs). For a review on RNNs, 
see, e.g., [2(| and references therein. 

The Elman network has d x input neurons, d u hidden neurons and d output neurons. The state, or activity, of 
each neuron is represented by a real number in all layers. Activities of input, hidden, and output neurons are indexed 
by time: quantities in the t th time instant are x t G R dx , u t G W 1 ^, o t G M. dx , respectively. Typical artificial neurons 
have a bias term. For the sake of notational simplicity and without loss of generality, this bias term is suppressed 
here. The bias term can be induced by considering that one of the coordinates is set to 1 for all times. The dynamics 
of our RNN network is as follows: 

u t+ i = Fu t + Gx ( , (1) 
o t +i = Hu f+ i, (2) 
xt+i = o t +i, (3) 

where F G R d - Xd ", G G R d » xd *, and H G R d - Xd ". 
In these equations: 

• JTJ describes the dynamics of the hidden state of the RNN, this is state equation, 

• (J5J is the observation equation, which provides the mapping of the state equation to the external world, and 

• iJSJ denotes our approximation that the output of the RNN will be used to estimate the input in the next time 
instant. This is the approximation equation. 

In what follows, the linear dynamical system of Eqs. |T]l-l[5Jl will be referred to as linear recurrent neural network 
(LRNN). 

B. Problem domain 

Consider time series {xi, X2, . . . , x^}. We assume that the identification of the Elman- type network and the approx- 
imation of the reconstructing hidden states may be improved if both the hidden representation and the parameters 
of the Elman network are subject to sparsification. Optimization via sparsification, that is via the L\ norm and the 
related e-insensitive norm |27j will be examined with respect to optimization via the quadratic Li norm. Optimization 
concerns both the hidden representation, that is u t , t — 1, 2, . . . and the weights, that is matrix F. The linear recurrent 
network assumption allows for a joined formalism and shall be tested on different time series. 

C. Notations 

Let us introduce the following notations: 
Letter types: Numbers (b), vectors 1 (b), matrices (B) are distinguished by the letter types. 

Special vectors and matrices: Certain letters denote particular vectors and matrices. Beyond the already intro- 
duced notations (x t , u t , o f , F, G, and H) e, E denote a vector and a matrix with all components equal to 1, 
respectively; I is the identity matrix; U, X are matrices with column indices corresponding to time and repre- 
senting the whole hidden and input time series, respectively; R is the e-insensitive multiplier matrix (see also 
matrix norms). 

Operations on matrices: Matrices are subject to the following operations: 



Vector means column vector here. 
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• M T denotes the transposed form of the matrix. 

• ir(M) is the trace. 

• wec(M) is the matrix in column vector form. The columns of the matrix follow each other in this vector. 

• A ® B denotes the Kronecker product of the matrices: A ® B := [aijB]. 

Matrix relations: Relations between matrices, such as >, >, and so on, concern all coordinates. 
Norms of matrices: The usual e-insensitive and Euclidean norms can be extended to matrices: 

• ||M|| R := ^2 \rriij\ eijl where |m| e is the e-insensitive cost function, i.e., 

i,j 

\m\ e := {0, if |m| < e; \m\ — e, otherwise}, 

and R = [e^]. 

• ||M||k := <r(M T KM) is the norm of the matrix induced by positive definite matrix K. 

For the case of vectors, choices of R = or K = I, simplifies to the well known L\ and Li norms, respectively. 
For the sake of simplicity and when no confusion may arise, L\ and Li norms will be denoted by || ■ ||i and 
| • || 2 , respectively. 

n 

Linear multi-term expression: Expression of type A/(Z) = ]P P;ZQ; + B is called linear multi-term expression 

»=i 

in matrix Z. 

Notation size(.) will be the shorthand to denote the sizes of matrices. For example, for matrix Z € M axfc , size(Z, 1) = a 
and size(Z, 2) = b. Temporal shift is denoted by operator z, which increases all temporal indices by one and produces 
x f — > x f+ i transitions. Operator z may also act on the columns of matrices. For example, for matrix X = [xi , . . . , xy] , 
zX = [x 2 , . . . ,x T+ i]. 

III. FORMALISM AND DERIVATIONS 

There are three constraints that we considered when building the framework: 

1. The framework is constrained to LRNN dynamics. 

2. The goal of the framework is the approximation and the prediction of time series Xt- 

3. Regularization, which can ensure unambiguous solution, shall be the tool of sparsification. 
In turn, both approximation and regularization need to be considered. 

A. Unified description of approximation and regularization 

Let us estimate the value of source Z subject to the conditions that different lg t (Z), the 8i (i — 1,2, ...,m) 
transformed values of Z, approximate certain Y$ values and also minimize certain Cj cost functions. Formally the goal 
is 

^ Cl (Y,-l 9 ,(Z))^miii. (4) 

i 

This goal is illustrated by the directed graph of Fig.^ which depicts the costs that may arise. Every edge represents 
certain transformation, and every cost is represented by a node. Some nodes are approximation nodes, i.e., represent 
approximation costs, whereas others are regularization nodes and represent costs associated with the regularization. 
In our case, Z corresponds to state sequences and the parameters of the LRNN system as it will be detailed below. 
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FIG. 1: Directed graph of approximation and regularization. Task: estimate unknown source Z by assuming parameterized 

9i transforms, i.e., certain le i (Z) forms such that the transformed values approximate prescribed values Yi for all i = 1, 2, 

Estimation means that the quality of approximation as determined by the cumulated costs of each transformation Ci is minimum. 



LRNN cost function 



The following cost function fulfills our requirements 



/(9,U) := A 



+ A a 



state 1 "appr || \\ a ppr 
instate, A appr , Ae, Atj > 0) 



\ereg(0) + \ureg(U) 



(5) 



Here, reg(-) denotes regularization terms, represents the parameters of the LRNN system (O := (F, G, H)), and 
U is the internal state sequence (U := [u 1; . . . , u^]) and Z stands for O or U. Optimizations over O and U were 
iterated repeatedly. This procedure ensures convergence, at least to a local minimum. 

Compact formulation can be gained by introducing matrix X, which represents state sequence and matrices C b , C e , 
which shall be useful for describing temporal evolutions: 

• State sequence matrix X := [x l5 . . . , xy], 

• Matrix C b ((cut-the-be ginning) performs transformation 

Ui, . . . ,U T -> u 2 , 

whereas matrix C e (cut-the-end) acts like 

ill,.. 



Ui, 
Xl, . 



jUt-i, 

,X T _l. 



That is 



C := 





1 
1 



,C e := 



1 

1 

: 

... 





(Tx(T-l) 



Now, we can write the approximation property as 

||H(PU + GX) - zX\\ w 0, (6) 

whereas the state equation assumes the following form: 

UC 6 = (FU + GX)C e . (7) 

Equations ©-0 arc of affinc conditions cither in U or in fixed LRNN parameters F, G, and - in a degenerated sense 
- also in matrix H. Notice that both © and J7J are linear multi-terms in the LRNN parameters and in the internal 
state, respectively, provided that the other terms are kept constant. Thus, cost function / can be iteratively optimized 
by alternatively minimizing over O and U. The cost function decreases in every step and, upon convergence, locally 
optimal parameter set and internal state sequence will be reached, provided that the regularization terms can be 
managed. Let us consider the ||Z||^-, or ||Z|| R regularization expressions. Such regularization terms 
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• can be seen as linear multi-terms (of unit length in Z) 

• ensure uniqueness in each iterative step if K = I and e = (i.e., in L\ norm). 



The latter condition corresponds to sparsification [1J, |28| . 

For the sake of simplicity in the form of the equations, we shall restrict our considerations to O := F. We shall 
study costs that emphasize reconstruction with sparsification 



/ S (P,U) 



x appr 



|H(FU + GX) -zX|| R 



(fu + gx)c e - uc 6 + a f hfii, + Au yu^ . 

R 



(8) 



Cost function f s is made of e-insensitive terms (R with e > 0) of Support Vector Machines (SVMs, see, e.g., 
[25ll2^.l2^| and the references therein). This separation of the Li and e-insensitive norms is somewhat arbitrary. Both 
norms are used in order to demonstrate the diversity enabled by the mathematical framework. Different combinations 
are, of course, possible, but no effort was made to find the better combination. 

For comparison, we shall study systems equipped with quadratic cost function f q : 



/,(F,U) := A appr ||H(FU + GX)-zX| 



+A S 



(FU + GX)C e -UC b + A F ||F|r + Au ||U 



(9) 



C. Optimization of cost functions 

We shall simplify the description by transcribing cost functions f s and f q into vectorial form. As it has been noted 
before, each term - irrespective of its e-insensitive or quadratic nature - can be seen as a linear multi-term in the 
corresponding iteration step. Thus, it is sufficient to consider two types of costs: 



A(Z) := A 
/ 2 (Z) := A 



^L^ZMi-N 

i=l 
n 

^LiZMi-N 



i=l 



K 



mm, 
z 



mm, 
z 



(10) 
(11) 



where Z denotes the variable subject to optimization in the given step, i.e. F, or U. Such terms can be added to 
form the full cost functions f s and f q , respectively. We shall make use of the forms: 



wec(BCD) = (D T <g)B)vec(C), 
tr(B T C) = vec{B) T vec(C), 



ir(BX T CYD) = vec(X) T (B $ I slze(x ,i)) T (D T <g> C) uec(Y). 



(12) 
(13) 
(14) 



For the derivation of Eqs. |Q and l|13l) . see |3(]]. The derivation of Eq. l|14f) as well as other details can be found in 
the Appendix. We have 

mm/i(Z) & 

minw T y, provided that {Dy < q}, where 
y 





z 




' " 


y := 


a 


,w := 


Ae 




a* 




Ae 



,M L : ^M,' L, 



D 



I 


- 




" r + n " 





I 




r — n 


-I 





,q := 








I. 




. 
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/ 2 (Z) = ^z T Hz + f T z (+const), where Z e R^ 2 ' 1 )* 8 ^ 2 ' 2 ). 

Optimizations can be executed by simply collecting the representative terms, because all terms are either in linear or 
in quadratic forms. 

IV. EXPERIMENTAL STUDIES 

In the numerical studies, advantages of the joined parameter and representational sparsification - to be referred to 
as the 'sparse case' - were explored for the LRNN architecture. The error of the approximation was monitored as a 
function of iteration number. The following test examples were chosen: 

MG-17,MG-30: The Mackey-Glass time series were originally introduced for modelling the behavior of blood cells 
[3l| using delayed difference equations: 

axt-r , 
x t = — — ox t . 

1 + x c t _ T 

This equation is one of the standard benchmark tests. We used the usual parameterizations of the Mackey- 
Glass time series (a = 0.2, b — 0.1, c = 10, t = 17 and 30) under sampling time equal to 6s. The time series was 
computed by 4 th order Runge-Kutta method. 

FIR-Laser: This time series is series 'A' of the competition at Santa Fe on time series prediction 32]. It is the 
measured intensity pattern of far infrared (FIR) laser in its chaotic state. The time series will be referred to as 
FIR-Laser series. 

Henon: The dynamical model 

x t +i = 1 - ax\ + y t , 
j/ t+ i = bx t 

was proposed by Henon |3^|. We use standard parameters (a = 1.4, b = 0.3), generate the Henon-attractor from 
pair (xt,xt-i), and predict Xt- 

We shall approximate one dimensional series (d x = 1) of length T with LRNN. 10 different T values (T = 
10, 20, . . . 100) shall be used. Matrix H, which approximates the output from the hidden activites of the LRNN, 
is a projection to the first coordinate. 

Observation, that is matrix H of the LRNN, shall be the projection to the first coordinate Each coordinates of 
matrices F, G were initialized by drawing them independently from the uniform distribution over [0, 1]. e-insensitivity 
was chosen as 0.05, that is, R = 0.05E. The dimension of the hidden state was set to d u = 4. 

Normalization steps were applied to allow quantitative comparisons: 

• Time series were scaled to vary between [—1, 1]. 

• The A's of Eqs. © and © were chosen to allow for similar contributions for every term, independently from 
the dimension of the hidden state of the RNN and the number of the adjustable parameters: 



J- A 1 \ J_ \ _J_ 

Td x ' state ~ (T - l)rf u ' U ~ Td u ' d^ 



Optimizations for the sparse case [see Eq. ©] were compared with optimizations using quadratic cost function 
[Eq. . There is a large diversity of possible mixed cost functions. No effort was made to select the main contributing 
terms to the differences in our results. Such differences might change from problem to problem. 
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A. Results 

Optimization warrant locally optimal solutions. To overcome this problem, in each particular experiment, 20 
random initializations of matrices F and G were made and 20 different optimizations were executed. Averages of the 
costs f q and f a were computed and shall be depicted. 

Approximation error as a function of iteration number is shown in Fig.|5]for the case of the MG-30 problem. Results 
are averaged over 20 experiments. Similar results were found in the other experiments, too. Time series of different 
length were tried. According to the figure, sparsification has its advantages, it makes convergence faster and more 
uniform: 

The approximation produced small errors for the quadratic norm up the lengths about 50. The iteration number 
required for approximate convergence (the 'knees' of the curves) was below 10. For time series having lengths between 
50 and 100, the error increased considerably and the typical iteration number required for approximate convergence 
increased to 20. One may say that up to 100 step long time series and for the quadratic case, 20 iterations are 
satisfactory to reach stable approximation error, i.e., the close neighborhood of the local minimum. 

For the sparse case, the situation is different. Convergence is very fast, 6 iterations were satisfactory to reach stable 
approximation error in all cases. Errors converged in a few steps. 

(a): Quadratic (f ) (b): Sparse (f g ) 

0.12 • 




FIG. 2: Performance under quadratic cost and for the sparse case. Approximation error versus iteration number and the length 
of the time series (T) approximated for the case of the MG-30 problem. Results are averaged over 20 random experiments, 
(a): (f q ); the case for quadratic cost function, (b): (f 3 ); the sparse case. Convergence speed of iteration is faster for the sparse 
case - note the differences in scales. 

Predictions of the optimized LRNNs are shown in Fig. [21 for 100 time steps. Sparsification again shows improved 
interpolating capabilities. This impression is reinforced by Fig. which depicts averages of 20 randomly initialized 
computations, alike to the one shown in Fig. [21 Averaged prediction errors, that is A appr ||H(FU + GX) - zX\\ z and 
Xappr ||H(FU + GX) — zX|| R values for the quadratic and for the sparse case are depicted in the figure. Respective 
standard deviations as well as the best and the worst cases are also provided. 

Table [I] shows small errors for the e-insensitive norm. According to Fig. 01 the deviation from the average is also 

TABLE I: Time averaged prediction error for the e-insensitive cost. 



MG-17 


MG-30 


FIR-Laser 


Henon 


3.3- 10" 4 


4.6 ■ 10~ 4 


1.3 • 10~ 4 


3- 10" 4 



small for long intervals, but - due to the nature of this cost function, which measures costs relative to 0.05 - sometimes 
it is relatively large. The comparison of the different costs is somewhat arbitrary. For time series generated by MG-17, 
MG-30, FIR-Laser, and Henon systems, the LRNN trained by joined parameter and representational sparsification 
seems to exhibit a factor of three better approximation than the LRNN using the quadratic norm. Such crude 
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FIG. 3: Predictions of LRNN systems trained by different methods and for different problems. Problem set: MG-17, MG-30, 
FIR-Laser, Henon systems are shown in columns 1, 2, 3, and 4, respectively, (a): 'Quadratic', prediction for quadratic cost 
function, (b): 'Sparse', prediction for the sparse case. Length of approximated time series: 100. Black: original time series. 
Gray: approximation. For detailed results, see, Fig. 0] 



estimation can be gained by comparing 0.158, (the square root of 0.025, which is an approximate low value for the 
error of the quadratic norm (Fig. |2J)) with the value of e = 0.05, the estimation for the sparse case from above. We 
have compared the normalized root mean square error for the two norms: If the measured value is Ot and the target 
is Xt then 



NMRSE := ~ x ^ 



VWt - <^» 2 > 

where (■) denotes averaging over t. For the quadratic and the e-insensitive cases we received roughly N RAISE = 0.3 
and NRMSE = 0.08, respectively. That is, comparison using NRMSE, which favors the quadratic norm, there is an 
advantage of a factor 3 for the e-insensitive norm. 

Finally, we note that long term prediction is hard for the linear approach: Apart from a zero measure subset, linear 
recurrent networks either converge or diverge exponentially. 



V. DISCUSSIONS 



First, let us mention that convergence of certain sparsification methods can be questioned |34| . The method 
we presented makes use of the L\ norm for sparsification, warrants convergence and is an extension of previous 
architectures, which did not consider a hidden predictive model. 

We should address the following questions: 

1. What is the advantage of sparse representation? 

2. What is the advantage of weight sparsification, or pruning? 

Question^is intriguing, because neuronal coding seem to apply sparsification as the basic strategy. Neuronal firing 
in the brain is sparse, which could serve the minimization of energy consumption 35] ■ It is now well established that 
joint constraints of sparsification and reconstruction (i.e., generative) capabilities produce receptive fields similar to 
those found in the primary visual cortex |15|. More sophisticated architectures, which also apply sparsification can 
reproduce more details of the receptive fields (3(| . 

Are there other advantages beyond energy minimization? Recently, independent component analysis (ICA) |37l 
|2M|23, an information transfer optimizing scheme, has been related to sparsification It has been shown that 
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FIR-Laser 
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.7. ~ ,~ — , - —, ,t, - ,t, — , - ,- Henon 
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FIG. 4: Prediction error averaged over 20 experiments. Experiments are averaged over 20 randomly initialized and optimized 
LRNN models for the 4 time series examples ((a): MG-17, (b): MG-30, (c): FIR-Laser, (d): Henon). Horizontal axis: time 
T. Vertical axis: error. Solid line: averaged magnitude. Dashed line: standard deviation. Dotted line: best and worst cases. 
Black: results for quadratic cost function. Gray: results for the sparse case. Prediction error averaged over T is on the order 
of 10" 4 for the gray curves. See also Table 



ICA and appropriate thresholding corresponds to well established denoising methods. Denoising, that is the removal 
of structureless high entropy portion of the input and the uncovering of structures are important tasks for learning. 
Subliminal statistical learning is indeed, a strategy applied by the brain 01 . ICA with denoising capabilities, called 
sparse code shrinkage (SCS) |40j, enables local noise estimation and local noise filtering. SCS is most desirable, 
because it optimizes noise filtering with respect to the experienced inputs. 

Another advantage emerges if sparsified representation is embedded into reconstruction networks. This construct 
can produce overcomplete representations |l5| which, by construction, exhibit graceful degradation. Also, overcom- 
plete representations can produce very different outputs for similar inputs, a most desirable feature for categorization 
and decision making. 

It is easy to see that norm L\ biases the system towards sparse representation compared to the Li norm: The L\ 
norm produces larger costs for small signals and depresses small signals more efficiently. This statement seems more 
general, because the e-insensitive cost function is closely related to sparsification networks ^3 ■ 

There are different noise models behind the L\ and the L-2 norms. In such considerations, the norm of the approx- 
imation is seen as the log-likelihood of the distribution of the noise. The L\ and the I/2 norms, as well as that of the 
epsilon- insensitive norm are all incorporated into the following log-likelihood expression |43|: 

/>oo />oo 

V(x) = -log dp dt^e-^ x '^P((3,t), (15) 



where x is the argument of the respective norms. For example, quadratic loss function emerges if P(/3,t) = 5([3 — 
2^)S(t), where S(.) is Dirac's delta function. The L\ loss is recovered by setting P((3,t) = (3 2 exp(— jj?z)S(t). The 



expression for e-insensitive loss in a factorized form for P((3,t) is as follows: P((3,t) = exp(— -^ z )\ t {t), where 
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= 2(£+i) (X[~£,£]{t) + 5(t — e) + 8(t ; + e)), X[— e,e] is the characteristic function of the interval [— e,e], and C is 
a normalization constant. The interpretation is that the noise affecting the data is additive and Gaussian, but its 
mean can be different from zero. The variance and mean of the noise are random variables with given probability 
distributions. For diminishing range of the non-zero mean Gaussian distributions, the expression approximates the 
L\ norm |43l |. In turn, both the L\ norm and the e-insensitive norm correspond to super-Gaussian noise distributions. 

Concerning Questional sparsification of weight matrices, e.g., exponential forgetting, pruning of weights, as well 
as other computational means have been studied over the years. The interested reader is referred to the literature on 
this broad field 23|. Question [2 seems to have a particular advantage, called the Occam's razor principle. In reverse 
engineering, such as the search for the (hidden) parameters of an LRNN, there are many solutions having identical 
or similar properties. All of these could be the the solution of the task. Which one is the right one? Or, at least, 
which one is the most probable? The answer of the philosopher is that the simplest explanation, that is, the one that 
makes the least assumptions is the best, or the most probable. Similar ideas have emerged in information theory, 
within the context of Kolmogorov complexity |44| . A bridge has been built to connect the idea of least number of 
assumptions and concepts of complexity. This is the minimum description length principle |45|. which makes use 
of complexity measures both for the data and for the assumed family of probability distributions, that is for the 
model. The best, i.e., the most probable description, has the minimum complexity. Important connections between 
the minimum description length principle and regularization theory have been worked out in the literature pfil l47| . 
For a recent review on this subject, see |l7j . One may think of weight sparsification as a method, which decreases 
the number of parameters and thus searches for simpler descriptions, provided that the precision of the parameters is 
bounded. 

Our results indicate that local minima are better distributed for LI and/or e-insensitive norms than for the Li 
norm within the problem set we studied. However, it is known that there should be other problem sets, where the 
quadratic cost is superior to the absolute value 0] . It is easy to create an example: quadratic cost on the parameters 
suits parameter sets, which exhibit Gaussian distribution. In turn, the success of the L\ and e-insensitive norms raises 
the question about the type of problems, which profit from this cost function family. 

We need to examine the statistical properties of phenomena in nature. It seems that these phenomena have special 
distributions. The distributions are super-Gaussian, sometimes exponential and sometimes they have very heavy tails, 
e.g., they can be characterized by power-law distribution in a broad range. This looks typical for natural forms (c.f. 
fractals) and evolving/developing systems (see, e.g., p| and references therein). 

We have measured the four time series: Mackey-Glasses (MG-17, MG-30), FIR-Laser, Henon. The distribution of 
distances between zero crossings of time series of 10,000 step long were analyzed. 

1. The kurtosis of the experienced distributions were computed, according to 

kurtosis(X) = — - — j— — 3, 

a 

where E denotes expectation value, \x is the mean of variable X, a is its standard deviation. Positive kurtosis 
is typical for super-Gaussian distributions, which have heavy tails. Results are shown in Table [HJ For all cases, 

TABLE II: Kurtosis of distances between zero-crossings 



MG-17 


MG-30 


FIR-Laser 


Henon 


-0.92 


1.11 


101.8 


2.89 



but the MG-17 series, positive kurtosis was found. It is known, however, that series MG-17 is at the borderline 
of chaos: the Mackey-Glass time series becomes chaotic if r > 16.8. 

2. The second study concerned the power law behavior. It was found that the zero crossings for the Henon and 
the MG-30 systems approximate power-law distributions, the MG-17 series does not, and the FIR laser is in 
between: it exhibits an erratic descending curve on log-log scale. The slope of the log-log fits were smaller 
smaller than —1, except for the MG-17 time series. 

The problems we studied, typically have heavy tailed distributions. In turn, the L\ and/or the e-insensitive norms suit 
such problems better than Li norm, because the L\ norm corresponds to the Laplace (i.e., double sided exponential) 
distribution, which has a heavier tail than the Gaussian distribution. Power-law distributions, however, are even more 
at the extremes in this respect. Because such heavy tailed distributions are typical in nature (see, e.g., Q,Q), one 
may find even better norms when identification of emergent behaviors is the issue. 
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VI. CONCLUSION 



We have put forth a joint formalism for sparsification of both the representation and the structure of neural net- 
works. The approximation was applied to a recurrent neural network and the joined effect was studied. Experiments 
were conducted on a number of benchmark problems, such as the Mackey- Glass with parameters 17-30, the emission 
of a far-infrared laser, and the Henon time series. Our results suggest that the joined constraint on sparsification is 
advantageous on these examples. Namely, optimizations were faster and had local optima exhibited better interpo- 
lating properties for the sparsified case than for traditional quadratic norm. We have argued that these findings are 
due to the underlying statistics of these phenomena: quadratic norms assume Gaussian noise and distributions with 
heavy-tails are be better approximated by norms with lower degrees. Joint advantages of the removal of structureless 
high entropy, a feature of representational sparsification, and the discovery of structures with sparse networks needs 
to be further explored. 



APPENDIX A: DERIVATION OF EQ. (JTUl 



fr(BX T CYD) = tr (xB T ) 



XB J ) (CYD) 

= uec(lXB T ) T ?;ec(CYD) 

= [(B®I)wec(X)] T [(D T ® C) vec(Y)] 

= vec(X) T \(B ® I slze(x ,i)) T (D T ® C)j vec(Y) 



APPENDIX B: COMPUTING THE NORMS 



mm/i(Z) & 
<S> min \\tr [E T (A + A*)1|, 

{Z,A<*)} 1 ' 



provided that < 



Y. LjZMj - N < R + A 

i=l 

£l»zm 4 -n) < r + a* 

i=l J 

< aw 



min rAuec(E) T (z;ec(A) +uec(A*)l =: |Ae T (a + a* 

{Z,A<*)} L 



provided that < 



£ (Mf ® Lj) z -a < r + n 

i=l 
n 

(Mf ® hi) z -a* < r n 

8=1 

< a(*) 
minw T y, provided that {Dy < q}, where 



,M L :=J2 M f 





z 







y := 


a 


, w := 


Ae 




a* 




Ae 



D 



I 


" 




r + n 





I 




r — n 


I 




,q:= 















I 
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/ 2 (Z) = Xtr 



Xtr 



L t ZM, - N ] K ^ LjZMj - N 



MfZ T (Lf KLj) ZMj - 2 ^ N T KLjZMj + N T KN 



i«ec(Z) T ( 2A ^ { (Mf ® I^ e(Zil) ) T [Mj ® (LfKIy)] } ] wec(Z) 



-2A53«ec(LjKNMj) 



;(Z) + Atr (N T KN) 



—z T Hz + f T z (+consi), where Z e R**«(z.i)x«» e (z,a) < 



Optimizations can be executed by simply collecting the representative terms, because all terms are either in linear 
or in quadratic forms. 
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